2023-11-16
type.convert() do?na.omit vs. na.exclude vs. na.deletelm using the mice package.(MICE = Multiple Imputation through Chained Equations)
Jonas Kristoffer Lindelov has built up a terrific resources to explain this at
https://lindeloev.github.io/tests-as-linear/
What’s the point?
Analyses D-E are more commonly thought about in the context of generalized linear models, as we’ll see in 432.
Let’s open a small, simulated data set with 100 subjects and some missing values.
sim1 <- read_csv("c22/data/c22_sim1.csv") |>
type.convert(as.is = FALSE, na.strings = "NA")
head(sim1)# A tibble: 6 × 6
subject out_q out_b pred1 pred2 pred3
<fct> <dbl> <fct> <dbl> <dbl> <fct>
1 S001 81.1 Yes 8.8 20.5 Middle
2 S002 105. No 7.1 24.9 High
3 S003 NA <NA> 9.9 17.4 Middle
4 S004 NA No 8.9 31.8 <NA>
5 S005 75.9 <NA> NA 22 High
6 S006 79.8 No 9.7 NA <NA>
type.convert() do?Tries to convert each column (individually) to either logical, integer, numeric, complex or (if a character vector) to factor.
F, T, FALSE, TRUE or NA values are made into logical.na.strings parameter to add missing strings (default = "NA")as.is = FALSE converts characters to factors. as.is = TRUE is the default.sim1 data| Variable | Description |
|---|---|
subject |
Subject identifier |
out_q |
Quantitative outcome |
out_b |
Binary outcome with levels Yes, No |
pred1 |
Predictor 1 (quantitative) |
pred2 |
Predictor 2 (also quantitative) |
pred3 |
Predictor 3 (categories are Low, Middle, High) |
subject and pred3# A tibble: 6 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 out_b 12 12
2 pred2 11 11
3 pred1 10 10
4 out_q 8 8
5 pred3 6 6
6 subject 0 0
without dealing with the missing data, so that we run:
Call:
lm(formula = out_q ~ pred1 + pred2 + pred3, data = sim1)
Residuals:
Min 1Q Median 3Q Max
-39.164 -13.900 2.419 15.541 34.156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.2070 18.6185 5.651 3.82e-07 ***
pred1 -0.8361 1.3010 -0.643 0.523
pred2 0.2611 0.4614 0.566 0.573
pred3Middle -1.3498 5.6802 -0.238 0.813
pred3High -2.7443 5.5427 -0.495 0.622
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.74 on 65 degrees of freedom
(30 observations deleted due to missingness)
Multiple R-squared: 0.01843, Adjusted R-squared: -0.04198
F-statistic: 0.3051 on 4 and 65 DF, p-value: 0.8736
How can we tell how many observations will be used?
Analysis of Variance Table
Response: out_q
Df Sum Sq Mean Sq F value Pr(>F)
pred1 1 209.8 209.81 0.5976 0.4423
pred2 1 132.1 132.14 0.3763 0.5417
pred3 2 86.5 43.24 0.1231 0.8843
Residuals 65 22821.9 351.11
# A tibble: 1 × 6
r.squared adj.r.squared sigma statistic p.value df
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0184 -0.0420 18.7 0.305 0.874 4
3 4 5 6 13 16 19 26 27 29 30 34 39 48 51 56 62 66 67 68 72 75 81 83 86 89
3 4 5 6 13 16 19 26 27 29 30 34 39 48 51 56 62 66 67 68 72 75 81 83 86 89
93 94 96 97
93 94 96 97
attr(,"class")
[1] "omit"
na.action setting in lm is na.exclude which pads out predicted values and residuals with NAs instead of omitting the 30 observations listed above.lm(out_q ~ pred1 + pred2 + pred3,
data = sim1, na.action = na.exclude)
mod1 with na.omit and na.excludeMultiple imputation for missing data in epidemiological and clinical research: potential and pitfalls
In this article, we review the reasons why missing data may lead to bias and loss of information in epidemiological and clinical research. We discuss the circumstances in which multiple imputation may help by reducing bias or increasing precision, as well as describing potential pitfalls in its application. Finally, we describe the recent use and reporting of analyses using multiple imputation in general medical journals, and suggest guidelines for the conduct and reporting of such analyses.
Note: The next 7 slides are derived from Sterne et al.
Consider, for example, a study investigating the association of systolic blood pressure with the risk of subsequent coronary heart disease, in which data on systolic blood pressure are missing for some people.
The probability that systolic blood pressure is missing is likely to:
If we assume that data are missing at random and that we have systolic blood pressure data on a representative sample of individuals within strata of age, smoking, body mass index, and coronary heart disease, then we can use multiple imputation to estimate the overall association between systolic blood pressure and coronary heart disease.
“Missing at random” is an assumption that justifies the analysis, and is not a property of the data.
Sometimes, it is impossible to account for systematic differences between missing and observed values using the available data.
Where complete cases and multiple imputation analyses give different results, the analyst should attempt to understand why, and this should be reported in publications.
If we assume data are MAR, then unbiased and statistically more powerful analyses (compared with analyses based on complete cases) can generally be done by including individuals with incomplete data.
There are circumstances in which analyses of complete cases will not lead to bias.
Multiple imputation … aims to allow for the uncertainty about the missing data by creating several different plausible imputed data sets and appropriately combining results obtained from each of them.
The first stage is to create multiple copies of the dataset, with the missing values replaced by imputed values.
Note that single Imputation of missing values usually causes standard errors to be too small, since it fails to account for the fact that we are uncertain about the missing values.
The second stage is to use standard statistical methods to fit the model of interest to each of the imputed datasets.
fram_raw <- read_csv("c22/data/framingham.csv", show_col_types = FALSE) |>
clean_names()
dim(fram_raw)[1] 4238 17
[1] 645
| Variable | Description |
|---|---|
educ |
four-level factor: educational attainment |
smoker |
1 = current smoker at examination time, else 0 |
sbp |
systolic blood pressure (mm Hg) |
obese |
1 if subject’s bmi is 30 or higher, else 0 |
glucose |
blood glucose level in mg/dl |
fram_sub <- fram_raw |>
mutate(educ = fct_recode(factor(education),
"Some HS" = "1",
"HS grad" = "2",
"Some Coll" = "3",
"Coll grad" = "4")) |>
mutate(obese = as.numeric(bmi >= 30)) |>
rename(smoker = "current_smoker",
sbp = "sys_bp") |>
mutate(subj_id = as.character(subj_id)) |>
select(sbp, educ, smoker, obese, glucose, subj_id)
dim(fram_sub)[1] 4238 6
Use linear regression to predict sbp using two different models, in each case accounting for missingness via multiple imputation, where the predictors of interest are glucose, obese, educ, and smoker.
# A tibble: 6 × 14
sbp educ smoker obese glucose subj_id inv_sbp sbp_NA educ_NA smoker_NA
<dbl> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <fct> <fct> <fct>
1 106 Coll grad 0 0 77 1 9.43 !NA !NA !NA
2 121 HS grad 0 0 76 2 8.26 !NA !NA !NA
3 128. Some HS 1 0 70 3 7.84 !NA !NA !NA
4 150 Some Coll 1 0 103 4 6.67 !NA !NA !NA
5 130 Some Coll 1 0 85 5 7.69 !NA !NA !NA
6 180 HS grad 0 1 99 6 5.56 !NA !NA !NA
# ℹ 4 more variables: obese_NA <fct>, glucose_NA <fct>, subj_id_NA <fct>,
# inv_sbp_NA <fct>
Model 2: predict 1000/sbp using glucose and obese.
Model 4: predict 1000/sbp using glucose, obese, educ, and smoker.
Suppose we ignore the missingness and just run the model on the data with complete information on inv_sbp, glucose and obese.
m2_cc <- with(fram_sub_sh, lm(inv_sbp ~ glucose + obese))
tidy(m2_cc, conf.int = TRUE, conf.level = 0.95) |> select(-statistic) |>
kable(digits = 3) |> kable_styling(font_size = 28)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 8.259 | 0.066 | 0 | 8.129 | 8.389 |
| glucose | -0.005 | 0.001 | 0 | -0.007 | -0.004 |
| obese | -0.719 | 0.056 | 0 | -0.828 | -0.610 |
Residual standard error: 1.14 on 3833 degrees of freedom
(402 observations deleted due to missingness)
Multiple R-squared: 0.05531, Adjusted R-squared: 0.05481
F-statistic: 112.2 on 2 and 3833 DF, p-value: < 2.2e-16
m4_cc <- lm(inv_sbp ~ glucose + obese + smoker + educ, data = fram_sub_sh)
tidy(m4_cc, conf.int = TRUE) |> select(-statistic) |>
kable(digits = 3) |> kable_styling(font_size = 28)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 7.967 | 0.074 | 0 | 7.822 | 8.111 |
| glucose | -0.005 | 0.001 | 0 | -0.006 | -0.003 |
| obese | -0.650 | 0.057 | 0 | -0.761 | -0.539 |
| smoker | 0.253 | 0.037 | 0 | 0.180 | 0.325 |
| educHS grad | 0.196 | 0.044 | 0 | 0.109 | 0.283 |
| educSome Coll | 0.251 | 0.054 | 0 | 0.146 | 0.357 |
| educColl grad | 0.317 | 0.062 | 0 | 0.196 | 0.438 |
Residual standard error: 1.126 on 3733 degrees of freedom
(498 observations deleted due to missingness)
Multiple R-squared: 0.07919, Adjusted R-squared: 0.07771
F-statistic: 53.5 on 6 and 3733 DF, p-value: < 2.2e-16
# A tibble: 7 × 3
variable n_miss pct_miss
<chr> <int> <dbl>
1 glucose 388 9.16
2 educ 105 2.48
3 obese 19 0.448
4 sbp 0 0
5 smoker 0 0
6 subj_id 0 0
7 inv_sbp 0 0
How many subjects have complete / missing data that affect this model?
printFlag = FALSE eliminates a lot of unnecessary (and not particularly informative) output.Class: mids
Number of multiple imputations: 15
Imputation methods:
sbp educ smoker obese glucose subj_id inv_sbp
"" "polyreg" "" "pmm" "pmm" "" ""
PredictorMatrix:
sbp educ smoker obese glucose subj_id inv_sbp
sbp 0 1 1 1 1 0 1
educ 1 0 1 1 1 0 1
smoker 1 1 0 1 1 0 1
obese 1 1 1 0 1 0 1
glucose 1 1 1 1 0 0 1
subj_id 1 1 1 1 1 0 1
Number of logged events: 1
it im dep meth out
1 0 0 constant subj_id
miceDefault methods include:
pmm predictive mean matching (default choice for quantitative variables)logreg logistic regression (default for binary categorical variables)polyreg polytomous logistic regression (for nominal multi-categorical variables)polr proportional odds logistic regression (for ordinal categories)but there are cart methods and many others available, too.
This will store the fifth imputed data set in imp_5.
> summary(m2_mods)
# A tibble: 45 × 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) 8.30 0.0623 133. 0 4238
2 glucose -0.00571 0.000728 -7.84 5.77e-15 4238
3 obese -0.709 0.0525 -13.5 1.27e-40 4238
4 (Intercept) 8.31 0.0626 133. 0 4238
5 glucose -0.00583 0.000733 -7.95 2.45e-15 4238
6 obese -0.708 0.0526 -13.5 1.50e-40 4238
# ... with 39 more rows
Consider working with the analysis done on the 4th imputed data set (of the 15 created)…
term estimate std.error statistic df p.value
1 (Intercept) 8.284586073 0.0676787686 122.410414 553.4504 0.000000e+00
2 glucose -0.005467924 0.0007996132 -6.838211 475.2087 2.473731e-11
3 obese -0.707891530 0.0526359301 -13.448827 4188.1675 2.132278e-40
2.5 % 97.5 %
1 8.151647406 8.417524740
2 -0.007039138 -0.003896709
3 -0.811085880 -0.604697181
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 8.259 | 0.066 | 124.78 | 0 | 8.129 | 8.389 |
| glucose | -0.005 | 0.001 | -6.72 | 0 | -0.007 | -0.004 |
| obese | -0.719 | 0.056 | -12.94 | 0 | -0.828 | -0.610 |
summary(m2_pool, conf.int = TRUE, conf.level = 0.95) |>
select(-df) |> kable(digits = 3) |> kable_styling(font_size = 28)| term | estimate | std.error | statistic | p.value | 2.5 % | 97.5 % |
|---|---|---|---|---|---|---|
| (Intercept) | 8.285 | 0.068 | 122.410 | 0 | 8.152 | 8.418 |
| glucose | -0.005 | 0.001 | -6.838 | 0 | -0.007 | -0.004 |
| obese | -0.708 | 0.053 | -13.449 | 0 | -0.811 | -0.605 |
Class: mipo m = 15
term m estimate ubar b t dfcom
1 (Intercept) 15 8.284586073 3.910023e-03 6.284932e-04 4.580416e-03 4235
2 glucose 15 -0.005467924 5.372312e-07 9.576573e-08 6.393813e-07 4235
3 obese 15 -0.707891530 2.758028e-03 1.173120e-05 2.770541e-03 4235
df riv lambda fmi
1 553.4504 0.17145493 0.146360670 0.149428830
2 475.2087 0.19014180 0.159763988 0.163278086
3 4188.1675 0.00453704 0.004516549 0.004991587
Definitions of these terms are in the mipo help file.
riv = relative increase in variance attributable to non-responsefmi = fraction of missing information due to non-response# A tibble: 105 × 6
term estimate std.error statistic p.value nobs
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 (Intercept) 7.99 0.0687 116. 0 4238
2 glucose -0.00529 0.000721 -7.34 2.58e-13 4238
3 obese -0.625 0.0525 -11.9 3.83e-32 4238
4 smoker 0.239 0.0349 6.86 7.77e-12 4238
5 educHS grad 0.209 0.0415 5.05 4.61e- 7 4238
6 educSome Coll 0.292 0.0504 5.79 7.37e- 9 4238
7 educColl grad 0.341 0.0579 5.89 4.19e- 9 4238
8 (Intercept) 8.00 0.0695 115. 0 4238
9 glucose -0.00533 0.000727 -7.33 2.76e-13 4238
10 obese -0.628 0.0526 -11.9 2.50e-32 4238
# ℹ 95 more rows
m4_pool <- pool(m4_mods)
summary(m4_pool, conf.int = TRUE, conf.level = 0.95) |>
select(-df) |> kable(digits = 3) |> kable_styling(font_size = 28)| term | estimate | std.error | statistic | p.value | 2.5 % | 97.5 % |
|---|---|---|---|---|---|---|
| (Intercept) | 7.976 | 0.074 | 107.943 | 0 | 7.831 | 8.121 |
| glucose | -0.005 | 0.001 | -6.389 | 0 | -0.007 | -0.003 |
| obese | -0.626 | 0.053 | -11.891 | 0 | -0.730 | -0.523 |
| smoker | 0.240 | 0.035 | 6.858 | 0 | 0.171 | 0.308 |
| educHS grad | 0.198 | 0.042 | 4.701 | 0 | 0.115 | 0.280 |
| educSome Coll | 0.285 | 0.051 | 5.583 | 0 | 0.185 | 0.385 |
| educColl grad | 0.328 | 0.059 | 5.555 | 0 | 0.212 | 0.443 |
tidy(m4_cc, conf.int = TRUE, conf.level = 0.95) |>
kable(digits = 3) |> kable_styling(font_size = 28)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 7.967 | 0.074 | 107.984 | 0 | 7.822 | 8.111 |
| glucose | -0.005 | 0.001 | -6.382 | 0 | -0.006 | -0.003 |
| obese | -0.650 | 0.057 | -11.472 | 0 | -0.761 | -0.539 |
| smoker | 0.253 | 0.037 | 6.794 | 0 | 0.180 | 0.325 |
| educHS grad | 0.196 | 0.044 | 4.421 | 0 | 0.109 | 0.283 |
| educSome Coll | 0.251 | 0.054 | 4.686 | 0 | 0.146 | 0.357 |
| educColl grad | 0.317 | 0.062 | 5.149 | 0 | 0.196 | 0.438 |
Class: mipo m = 15
term m estimate ubar b t dfcom
1 (Intercept) 15 7.976258626 4.773994e-03 6.433669e-04 5.460252e-03 4231
2 glucose 15 -0.005044811 5.269184e-07 9.055590e-08 6.235114e-07 4231
3 obese 15 -0.626379499 2.758532e-03 1.549957e-05 2.775065e-03 4231
4 smoker 15 0.239874783 1.219377e-03 3.749115e-06 1.223376e-03 4231
5 educHS grad 15 0.197535267 1.722712e-03 4.020911e-05 1.765602e-03 4231
6 educSome Coll 15 0.284731274 2.543737e-03 5.364879e-05 2.600963e-03 4231
7 educColl grad 15 0.327760325 3.354429e-03 1.189877e-04 3.481349e-03 4231
df riv lambda fmi
1 714.9262 0.143749251 0.125682488 0.128118163
2 501.4894 0.183316739 0.154917726 0.158267974
3 4159.4759 0.005993359 0.005957653 0.006435274
4 4201.6596 0.003279589 0.003268868 0.003742976
5 3514.9496 0.024896627 0.024291842 0.024846545
6 3618.4879 0.022496572 0.022001612 0.022541720
7 2938.1779 0.037836603 0.036457187 0.037112396
est lo 95 hi 95 fmi
R^2 0.05608137 0.04307754 0.07057275 0.04317449
est lo 95 hi 95 fmi
adj R^2 0.05563553 0.04267941 0.07008282 0.04351396
est lo 95 hi 95 fmi
R^2 0.07876698 0.06365358 0.09519128 0.02679134
est lo 95 hi 95 fmi
adj R^2 0.07746049 0.06245732 0.09378374 0.02723597
The models must be nested (same outcome, one set of predictors is a subset of the other) for this to be appropriate.
We’ll use the Wald test after a linear regression fit.
test statistic df1 df2 dfcom p.value riv
1 ~~ 2 25.43738 4 4049.173 4231 7.634835e-21 0.0230498
Could also use a likelihood ratio test.
mod4 (6th imputation)mod4 (6th imputation)mod4 (1st imputation)mod4 (1st imputation)How should we report on analyses potentially affected by missing data?
How should we report on analyses that involve multiple imputation?
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Locale:
LC_COLLATE=English_United States.utf8
LC_CTYPE=English_United States.utf8
LC_MONETARY=English_United States.utf8
LC_NUMERIC=C
LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
Package version:
abind_1.4-5 askpass_1.2.0 backports_1.4.1
base64enc_0.1.3 bit_4.0.5 bit64_4.0.5
blob_1.2.4 boot_1.3-28.1 brio_1.1.3
broom_1.0.5 bslib_0.5.1 cachem_1.0.8
callr_3.7.3 car_3.1-2 carData_3.0-5
cellranger_1.1.0 cli_3.6.1 clipr_0.8.0
codetools_0.2-19 colorspace_2.1-0 compiler_4.3.1
conflicted_1.2.0 cpp11_0.4.6 crayon_1.5.2
curl_5.1.0 data.table_1.14.8 DBI_1.1.3
dbplyr_2.4.0 desc_1.4.2 diffobj_0.3.5
digest_0.6.33 dplyr_1.1.3 dtplyr_1.3.1
ellipsis_0.3.2 evaluate_0.22 fansi_1.0.5
farver_2.1.1 fastmap_1.1.1 fontawesome_0.5.2
forcats_1.0.0 foreach_1.5.2 fs_1.6.3
gargle_1.5.2 generics_0.1.3 GGally_2.1.2
ggplot2_3.4.4 glmnet_4.1-8 glue_1.6.2
googledrive_2.1.1 googlesheets4_1.1.1 graphics_4.3.1
grDevices_4.3.1 grid_4.3.1 gridExtra_2.3
gtable_0.3.4 haven_2.5.3 highr_0.10
hms_1.1.3 htmltools_0.5.6.1 httr_1.4.7
ids_1.0.1 isoband_0.2.7 iterators_1.0.14
janitor_2.2.0 jomo_2.7-6 jquerylib_0.1.4
jsonlite_1.8.7 kableExtra_1.3.4 knitr_1.44
labeling_0.4.3 lattice_0.21-8 lifecycle_1.0.3
lme4_1.1-34 lubridate_1.9.3 magrittr_2.0.3
MASS_7.3-60 Matrix_1.6-1.1 MatrixModels_0.5.2
memoise_2.0.1 methods_4.3.1 mgcv_1.8.42
mice_3.16.0 mime_0.12 minqa_1.2.6
mitml_0.4-5 modelr_0.1.11 munsell_0.5.0
naniar_1.0.0 nlme_3.1-162 nloptr_2.0.3
nnet_7.3-19 norm_1.0.11.1 numDeriv_2016.8.1.1
openssl_2.1.1 ordinal_2022.11.16 pan_1.9
parallel_4.3.1 pbkrtest_0.5.2 pillar_1.9.0
pkgbuild_1.4.2 pkgconfig_2.0.3 pkgload_1.3.3
plyr_1.8.9 praise_1.0.0 prettyunits_1.2.0
processx_3.8.2 progress_1.2.2 ps_1.7.5
purrr_1.0.2 quantreg_5.97 R6_2.5.1
ragg_1.2.6 rappdirs_0.3.3 RColorBrewer_1.1-3
Rcpp_1.0.11 RcppEigen_0.3.3.9.3 readr_2.1.4
readxl_1.4.3 rematch_2.0.0 rematch2_2.1.2
reprex_2.0.2 reshape_0.8.9 rlang_1.1.1
rmarkdown_2.25 rpart_4.1.19 rprojroot_2.0.3
rstudioapi_0.15.0 rvest_1.0.3 sass_0.4.7
scales_1.2.1 selectr_0.4.2 shape_1.4.6
snakecase_0.11.1 SparseM_1.81 splines_4.3.1
stats_4.3.1 stringi_1.7.12 stringr_1.5.0
survival_3.5-5 svglite_2.1.2 sys_3.4.2
systemfonts_1.0.5 testthat_3.2.0 textshaping_0.3.7
tibble_3.2.1 tidyr_1.3.0 tidyselect_1.2.0
tidyverse_2.0.0 timechange_0.2.0 tinytex_0.48
tools_4.3.1 tzdb_0.4.0 ucminf_1.2.0
UpSetR_1.4.0 utf8_1.2.3 utils_4.3.1
uuid_1.1.1 vctrs_0.6.4 viridis_0.6.4
viridisLite_0.4.2 visdat_0.6.0 vroom_1.6.4
waldo_0.5.1 webshot_0.5.5 withr_2.5.1
xfun_0.40 xml2_1.3.5 yaml_2.3.7
431 Class 22 | 2023-11-16 | https://thomaselove.github.io/431-2023/